As someone who mainly uses pandas, what is the benefit of using DuckDB to write your queries over using pandas (or polars) to operate on the data. Is it that you can already subset the data without loading it into memory?
I use DuckDB because I can express complex analytic queries better using a SQL mental model. Most software people hate SQL because they can never remember its syntax, but for data scientists, SQL lets us express our thoughts more simply and precisely in a declarative fashion — which gets us query plan optimization as a plus.
People have been trying to get rid of SQL for years yet they only end up reinventing it badly.
I’ve written a lot of code and the two notations I always gravitate toward are the magrittr + dplyr pipeline notation and SQL.
The chained methods notation is a bit too unergonomic especially to express window functions and complex joins.
Spark started out with method chaining but eventually found that most people used Spark SQL.
In addition to this here's one really specific case: ever had a pandas groupby().apply() that took forever often mostly re-aggregating after the apply?
With columnar data DuckDuckGo is somuchfaster at this.
For one of my projects I have what sounds like a dumb workflow:
- JSON api fetches get cached in sqlite3
- Parsing the JSON gets done with sqlite3 JSON operators (Fast! Fault tolerant! Handles NULLs nicely! Fast!!).
- Collating data later gets queried with duckdb - everything gets munged and aggregated into the shape I want it and is persisted in parquet files
- When it's time to consume it duckdb queries my various sources, does my (used to be expensive) groupbys onthefly and spits out pandas data frames
- Lastly those data frames are small-ish, tidy and flexible
So yeah, on paper it sounds like these 3 libraries overlap too much to be use at the same time but in practice they can each have their place and interact well.
wenc|2 years ago
People have been trying to get rid of SQL for years yet they only end up reinventing it badly.
I’ve written a lot of code and the two notations I always gravitate toward are the magrittr + dplyr pipeline notation and SQL.
The chained methods notation is a bit too unergonomic especially to express window functions and complex joins.
Spark started out with method chaining but eventually found that most people used Spark SQL.
phoobahr|2 years ago
With columnar data DuckDuckGo is somuchfaster at this.
For one of my projects I have what sounds like a dumb workflow: - JSON api fetches get cached in sqlite3 - Parsing the JSON gets done with sqlite3 JSON operators (Fast! Fault tolerant! Handles NULLs nicely! Fast!!). - Collating data later gets queried with duckdb - everything gets munged and aggregated into the shape I want it and is persisted in parquet files - When it's time to consume it duckdb queries my various sources, does my (used to be expensive) groupbys onthefly and spits out pandas data frames - Lastly those data frames are small-ish, tidy and flexible
So yeah, on paper it sounds like these 3 libraries overlap too much to be use at the same time but in practice they can each have their place and interact well.