mildbyte | 8 months ago | on: Solving Wordle with uv's dependency resolver
mildbyte's comments
mildbyte | 1 year ago | on: Show HN: Pg_analytica – Speed up queries by exporting tables to columnar format
mildbyte | 1 year ago | on: Show HN: Pg_analytica – Speed up queries by exporting tables to columnar format
pg_analytics uses DataFusion (dedicated analytical query engine) to run the entire query, which can achieve orders of magnitude speedups over vanilla PG with indexes on analytical benchmarks like TPC-H. We use the same approach at EDB for our Postgres Lakehouse (I'm part of the team that works on it).
mildbyte | 2 years ago | on: LLaVA-1.6: Improved reasoning, OCR, and world knowledge
[0] https://mildbyte.xyz/blog/llama-cpp-python-llava-gpu-embeddi...
mildbyte | 3 years ago | on: I Migrated from a Postgres Cluster to Distributed SQLite with LiteFS
[0] https://github.com/splitgraph/seafowl/blob/main/examples/lit...
mildbyte | 3 years ago | on: Databases in 2022: A Year in Review
mildbyte | 3 years ago | on: Databases in 2022: A Year in Review
We're currently doing a lot of work at Splitgraph to reposition the product around this "analytics at the edge" use case, with usage-based billing, eventually moving our query execution from PostgreSQL to Seafowl.
mildbyte | 3 years ago | on: PostgREST – Serve a RESTful API from any Postgres database
You might be interested in what we're building: Seafowl, a database designed for running analytical SQL queries straight from the user's browser, with HTTP CDN-friendly caching [0]. It's a second iteration of the Splitgraph DDN [1] which we built on top of PostgreSQL (Seafowl is much faster for this use case, since it's based on Apache DataFusion + Parquet).
The tradeoff for allowing the client to run any SQL vs a limited API is that PostgREST-style queries have a fairly predictable and low overhead, but aren't as powerful as fully-fledged SQL with aggregations, joins, window functions and CTEs, which have their uses in interactive dashboards to reduce the amount of data that has to be processed on the client.
There's also ROAPI [2] which is a read-only SQL API that you can deploy in front of a database / other data source (though in case of using databases as a data source, it's only for tables that fit in memory).
mildbyte | 3 years ago | on: Show HN: Socrata Roulette – run random SQL on a random government dataset
Technically, the ability is already there (you can add a dataset to Splitgraph and select Socrata as a source if you know the dataset ID), but it's not as turnkey as landing on a dataset page and clicking a button. More to come!
mildbyte | 3 years ago | on: Command-line data analytics
FWIW, we use the Parquet format with DataFusion and get very good speeds similar to DuckDB [2], e.g. 1.5s to run a more complex aggregation query `SELECT date_trunc('month', tpep_pickup_datetime) AS month, COUNT(*) AS total_trips, SUM(total_amount) FROM tripdata GROUP BY 1 ORDER BY 1 ASC)` on a 55M row subset of NY Taxi trip data.
[0]: https://github.com/apache/arrow-datafusion/blob/master/dataf...
[1]: https://github.com/roapi/roapi/releases/tag/roapi-v0.8.0
mildbyte | 3 years ago | on: Command-line data analytics
mildbyte | 3 years ago | on: IOx: InfluxData’s New Storage Engine
It's a fairly different use case from DuckDB (query execution for Web applications vs fast embedded analytical database for notebooks) and the rest of the modern data stack (which mostly is about analytics internal to a company). Just to clarify, we're not related to IOx directly (only via us both using Apache DataFusion).
If we had to place Seafowl _inside_ of the modern data stack, it'd be mostly a warehouse, but one that is optimized for being queried from the Internet, rather than by a limited set of internal users. Or, a potential use case could be extracting internal data from your warehouse to Seafowl in order to build public applications that use it.
We don't currently ship a Web front-end and so can't serve as a replacement to Superset: it's exposed to the developer as an HTTP API that can be queried directly from the end user's Web browser. But we have some ideas around a frontend component: some kind of a middleware, where the Web app can pre-declare the queries it will need to run at build time and we can compute some pre-aggregations to speed those up at runtime. Currently we recommend querying it with Observable [0] for an end-to-end query + visualization experience (or use a different viz library like d3/Vega).
Re: the second question about Splitgraph for a data lake, the intention behind Splitgraph is to orchestrate all those tools and there the use case is indeed the modern data stack in a box. It's kind of similar to dbt Labs's Sinter [1] which was supposed to be the end-to-end data platform before they focused on dbt and dbt Cloud instead: being able to run Airbyte ingestion, dbt transformations, be a data warehouse (using PostgreSQL and a columnar store extension), let users organize and discover data at the same time. There's a lot of baggage in Splitgraph though, as we moved through a few iterations of the product (first Git/Docker for data, then a platform for the modern data stack). Currently we're thinking about how to best integrate Splitgraph and Seafowl in order to build a managed pay-as-you-go Seafowl, kind of like Fauna [2] for analytics.
Hope this helps!
[0] https://observablehq.com/@seafowl/interactive-visualization-...
mildbyte | 3 years ago | on: IOx: InfluxData’s New Storage Engine
It's a framework for writing query engines in Rust that takes care of a lot of heavy lifting around parsing SQL, type casting, constructing and transforming query plans and optimizing them. It's pluggable, making it easy to write custom data sources, optimizer rules, query nodes etc.
It's has very good single-node performance (there's even a way to compile it with SIMD support) and Ballista [1] extends that to build it into a distributed query engine.
Plenty of other projects use it besides IOx, including VegaFusion, ROAPI, Cube.js's preaggregation store. We're heavily using it to build Seafowl [2], an analytical database that's optimized for running SQL queries directly from the user's browser (caching, CDNs, low latency, some WASM support, all that fun stuff).
[0] https://github.com/apache/arrow-datafusion
mildbyte | 3 years ago | on: Litestream live replication has been moved to the LiteFS project
mildbyte | 3 years ago | on: CloudFront vs. Cloudflare, and how to reduce response times for both (2021)
mildbyte | 3 years ago | on: Show HN: Seafowl – CDN-friendly analytical database
A new project from us at Splitgraph: Seafowl, a database that's optimized for Web applications running analytical SQL queries straight from the user's browser. Used to power interactive visualizations and dashboards. Features:
- Fast: written in Rust and uses Apache DataFusion. About 5-10x faster than PostgreSQL (some benchmarks available at [1])
- Light: single 50MB binary that starts in 10ms
- Extensible: write user-defined functions in anything that compiles to WASM
- Cache-friendly: REST API designed to work well with CDNs like Cloudflare or caches like Varnish (as well as the user's browser cache).
- Demo of Seafowl providing data to an Observable notebook here [2] (press F12 and refresh the page to see caching in action)
Happy to answer any questions!
[1] https://observablehq.com/@seafowl/benchmarks
[2] https://observablehq.com/@seafowl/interactive-visualization-...
mildbyte | 3 years ago | on: Litestream live replication has been moved to the LiteFS project
mildbyte | 3 years ago | on: Litestream live replication has been moved to the LiteFS project
- rare writes that get directed to a single instance (e.g. using Fly.io's replay header), frequent reads (potentially at edge locations)
- no need to deploy a PostgreSQL cluster and set up logical replication
- SQLite database stored in object storage, reader replicas can boot up using the object storage copy and then get kept in sync by pulling data from the writer
- delay in replication is fine
LiteFS is probably going to be a great solution here since we're mainly using Fly.io and it has built-in support for it [1], but are there any alternatives that don't require Consul, still look like an SQLite database to the client and can work off of a HTTP connection to the primary, so that we don't have to require our users to deploy to Fly?
mildbyte | 3 years ago | on: Really divisionless random numbers
mildbyte | 3 years ago | on: Hosting SQLite databases on any static file hoster (2021)
To clarify, Seafowl itself can't be hosted statically (it's a Rust server-side application), but it works well for statically hosted pages. It's basically designed to run analytical SQL queries over HTTP, with caching by a CDN/Varnish for SQL query results. The Web page user downloads just the query result rather than required fragments of the database (which, if you're running aggregation queries, might have to scan through a large part of it).