top | item 45127389

(no title)

mrtimo | 5 months ago

I agree with this 100%. The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here [1].

I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.

I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.

[1] https://www.youtube.com/watch?v=PFUZlNQIndo [2] https://www.malloydata.dev/

discuss

order

orlp|5 months ago

> The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here.

That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.

Disclaimer: I work for Polars on said query execution.

phailhaus|5 months ago

The DataFrame interface itself is the problem. It's incredibly hard to read, write, debug, and test. Too much work has gone into reducing keystrokes rather than developing a better tool.

entropicdrifter|5 months ago

Just wanted to say I'm a huge fan of your work. Been using Polars for my team's main project for years and it just keeps getting better.

fumeux_fume|5 months ago

In the same talk, Mark acknowledges that "for data science workflows, database systems are frustrating and slow." Granted DuckDB is an attempt to fix that, most data scientists don't get to choose what database the data is stored in.

willvarfar|5 months ago

(I use duckdb to query data stored in parquet files)

esafak|5 months ago

Have you used Malloy in a pipeline, e.g., with Airflow? If so, how was the experience?