top | item 35455672

(no title)

phoobahr | 2 years ago

In addition to this here's one really specific case: ever had a pandas groupby().apply() that took forever often mostly re-aggregating after the apply?

With columnar data DuckDuckGo is somuchfaster at this.

For one of my projects I have what sounds like a dumb workflow: - JSON api fetches get cached in sqlite3 - Parsing the JSON gets done with sqlite3 JSON operators (Fast! Fault tolerant! Handles NULLs nicely! Fast!!). - Collating data later gets queried with duckdb - everything gets munged and aggregated into the shape I want it and is persisted in parquet files - When it's time to consume it duckdb queries my various sources, does my (used to be expensive) groupbys onthefly and spits out pandas data frames - Lastly those data frames are small-ish, tidy and flexible

So yeah, on paper it sounds like these 3 libraries overlap too much to be use at the same time but in practice they can each have their place and interact well.

discuss

No comments yet.